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ABSTRACT 



A method and system for aiding users in visualizing the 
relatedness of retrieved text documents and the topics to 
which they relate comprises training a classifier by seman- 
tically analyzing an initial group of manually-classified 
documents, positioning the classes and documents in two- 
dimensional space in response to semantic associations 
between the classes, and displaying the classes and docu- 
ments. The displayed documents may be retrieved by an 
information storage and retrieval subsystem in any suitable 
manner. 

22 Claims, 9 Drawing Sheets 
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METHOD AND SYSTEM FOR TWO- is free to place the POIs anywhere on the screen. The user 

DIMENSIONAL VISUALIZATION OF AN assigns a set of keywords to each POL The system then 

INFORMATION TAXONOMY AND OF TEXT retrieves documents and positions them between POIs to 

DOCUMENTS BASED ON TOPICAL wnich ^ey m related ^ s y stem determines the related- 

CONTENT OF THE DOCUMENTS 5 ncss between a document and a POI in response to the 

frequency with which the keywords corresponding to the 
POI occur in the document The system thus displays tokens 

BACKGROUND OF THE INVENTION representing similar documents near one another on the 

^ . - A „ A . £ screen and tokens representing less similar documents far- 

Thc present invention relates generally to information ther apart 

search and retrieval systems and more specifically, to a 10 _ _ f . ^ ^ n t . 

method and system for displaying visual reflations of . « fv das fy documents 

retrieved documents and the topics to which they relate, m ™ formation retrieval system under a predetermined set 

„ . , ' ^ of classes or a predetermined hierarchical taxonomy to aid 

Information search and retrieval systems locate docu- searching . The objective in text classification is to analyze 
ments stored in electronic media in response to queries J5 m docnmtQl md determine its topical content with 
entcredby auscr. Such a system may r^videmirft^e entry r ^ t0 a p^e^ed set of antid&tt topics. In a 
paths. For example, a user may enter a query consisting of ical tem , a TOmputer exe cutes an algorithm that sta- 
one or more search terms, and the system searches for any tisdcall ^ a set of classified documents, 
documents that include the terms. Alternatively a user may . e documents mat have been dassified by a human , ^d 
select a topic and the system searches for all documents 2Q uses the resulting t0 build a characterization of 
classified under that topic Topics may be arranged in ^ picar documents for a class. Then, the system classifies 
accordance with a predetermined hierarchical classification each Qew document to be stored in the system, i.e. an 
system. Regardless of the entry path, the system may locate ^ document that has not been previously classified, 
many documents, some of which may be more relevant to b detennining the statistical similarity of the document to 
the topic in which the user is interested and others of which ^ ^ protorype> Text classification methods include nearest- 
may be less relevant Still others may be completely irrel- neighbor classification and Bayesian classification in which 
evant. The user must then sift through the documents to ^ features of ^ Ba yesian classifier are the occurrence of 
locate those in which the user is interested. terms ^ ±c documents. 

Systems may aid the user in sifting through the retrieved It would be desirable to simultaneously visualize both the 

documents and using them as stepping stones to locate other 30 re iat e dness between text documents and classes and the 

documents of interest Commercially available systems are re latedness between the classes themselves. These problems 

known that sort the retrieved documents in order of rel- and deficiencies are clearly felt in the art and are solved by 

evance by assigning weights to the query terms. If the query mc invention in the manner described below, 
accurately reflects the user's topic of interest, the user may 

quickly locate the most relevant documents. 35 SUMMARY OF THE INVENTION 

Systems are known that incorporate 'relevance feed- The present invention comprises an electronic document 

back." A user indicates to the system the retrieved docu- storage and retrieval subsystem, a document classifier, and 

ments that the user believes are most relevant, and the a visualization subsystem. The present invention aids users 

system then modifies the query to further refine the search. in visualizing the relatedness of retrieved text documents 

For a comprehensive treatment of relevance ranking and 40 and the topics to which they relate, 

relevance feedback, see Gerard Salton, editor, The Smart Documents stored in the retrieval subsystem are manually 

Retrieval Systems-Experiments in Automatic Document classified, i.e., categorized by human operators or editors, 

Processing, N.J., Rrentice Hall, 1971; Gerard Salton, "Auto- into a predetermined set of classes or topics. An automatic 

matic term class construction using relevance — a summary classifier is constructed by calculating statistics of term 

of work in automatic pseudoclassification " Information 45 usage in the manually classified texts. This step trains the 

Processing & Management, 16:1-15, 1980; Gerard Salton et classifier or places the classifier in a state in which it can 

aL, Introduction to Modern Information Retrieval, McGraw- automatically, i.e., without intervention by human operators 

Hill, 1983. or editors, classify additional documents. The classifier then 

Practitioners in the art have also developed systems for applies a suitable statistical text analysis method to the 

providing the user with a graphical representation of the 50 classes to determine the semantic relatedness between each 

relevance of retrieved documents. In the Adaptive Inform a- pair of classes. The visualization subsystem uses a suitable 

tion Retrieval (AIR) system, described in R. K. Belew, multidimensional scaling method to determine the positions 

Adaptive Information Retrieval: Machine Learning in Asso- in two-dimensional space of the classes in response to the 

dative Networks, Ph.D. thesis, The University of Michigan, relatedness information. The resulting collection of 

1986, objects that include documents, keywords and authors 55 positions i referred to herein as a semantic space map, thus 

are represented by nodes of a neural network. A query may represents the semantic relatedness between each class and 

include any object in the domain. The system displays dots every other class by the spatial distance between them. The 

or tokens on a video display that represent the nodes visualization subsystem displays a token or icon represent- 

corresponding to the objects in the query. The system also ing each class on a suitable output device, such as a video 

displays tokens that represent nodes adjacent to those nodes 60 display. Hie classes represented by tokens spatially closer to 

and connects these related nodes with arcs in another one another are more related than those represented by 

system, known as Visualization by Example (VIBE), tokens spatiaily farther from one another. (For purposes of 

described in Kai A. Olson et aL, '^Visualization of a docu- ' convenience and brevity, the term "classes** may be used 

ment collection: The VIBE system,** Technical Report synonymously hereinafter in place of the phrase "tokens or 

US033/IS91001, School of Library and Information 65 icons representing the classes.**) 

Science, University of Pittsburgh, 1991, a user selects one or The visualization subsystem populates the semantic space 

more points of interest (POIs) on a video display. The user map with the (manually dassified) documents with which 
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the classifier was trained as well as with new, ie, automati- FIG. 11 illustrates an exemplary output of the system on 

cally classified, documents. The classifier produces a set of a video display terminal, showing the upper level of a 

class scores for each document, one score corresponding to hierarchical taxonomy; and 

each class. The visualization subsystem then positions the piQ, 12 illustrates an exemplary output of the system on 

document in two-dimensional space relative to the classes. 5 a video display teiminal, showing a lower level of the 

The present invention may be used to aid a researcher or hierarchical taxonomy. 

user in navigating through the documents. In response to a 

query entered by the user, the document retrieval system DETAILED DESCRIPTION OF A PREFERRED 

searches for a document using any suitable search method. EMBODIMENT 

The system then uses the populated semantic space map to 10 as illustrated in FIG. 1, a document storage and retrieval 

retrieve the positions of the documents found in the search subsystem 10 comprises a search and retrieval controller 12, 

and displays tokens or icons representing those documents. a 14 on which a database of documents is stored, and 

(For purposes of convenience and brevity, the term "docu- a video display terminal 16. A user (not shown) may input 

ments" may be used synonymously hereinafter in place of search queries via video display terminal 16. Unit 12 

the phrase 'tokens or icons representing the documents.**) A 15 searches disk 14 for documents that are responsive to the 

user can rapidly focus in on the documents most related to q Uerv and displays the documents on terminal 16 in the 

the topics of interest by selecting and reading only the manner described below. Although in the illustrated 

documents nearest those topics. The user can also browse embodiment, documents are stored locally to subsystem 10 

through related documents by selecting documents in close on 14^ fa 0 ther embodiments the documents may be 

proximity to each other on the display. By reading a few of 20 retrieved via a network (not shown) from remote locations, 

the documents in dose proximity to each other, the user can Terminal 16 displays the output of subsystem 10 to the 

rapidly conclude that a particular cluster of documents is of user m a fonnat A classification subsystem 18 

interest or is not of interest without reading all of them. classifies documents stored in subsystem 10 into one of a set 

Moreover, the user can select and read a sampling of of predetermined classes. A visualization subsystem 20 

documents located in different directions around a particular 25 p OS i uons the classes on the screen of terminal 16 in a manner 

location to determine the general direction in which the most such ^ ^ ^ ncG between every two classes is repre- 

interesting documents are likely to be located The user can scntative of the extent to which those classes are related to 

successively browse through documents located in that ea(A other relative to other classes. Visualization subsystem 

general direction until the user has located the desired 20 ^ SQ me retrieved documents on the screen of 

documents or discovers a different direction in which inter- 3° term inal 16 in a manner such that the distance between each 

esting documents are likely to be located. The user can thus document and a class is representative of the extent to which 

visualize the progress of the research by noting the "path" ^ document is related to that class relative to other classes, 

along which the user has browsed through documents and M DOted above me word «. class » as used herein ^ me 

the spatial relationship of those documents to the topics. context of displayed output, denotes a displayed token or 

The foregoing, together with other features and advan- 33 icon that represents the class. Similarly, the word 

tages of the present invention, will become more apparent "document,* 1 as used herein the context of displayed output, 

when referring to the following specification, claims, and denotes a displayed token or icon that represents the docu- 

accompanying drawings. men! Subsystem 10 may output the full document or a 

BRIEF DESCRIPTION OF THE DRAWINGS 40 V^M^'M^tonnmp^to* 

Subsystems 10, IS and 20 may comprise any suitable 

For a more complete understanding of the present combination of computer hardware and software that per- 
invention, reference is now made to the following detailed forms the methods described below. Persons of skill in the 
description of the embodiments illustrated in the accompa- art will readily be capable of designing suitable software 
nying drawings, wherein: 45 and/or hardware or obtaining its constituent elements from 

FIG. 1 is a diagrammatic illustration of the system of the commercial sources in view of the teachings herein, 
present invention; As illustrated in FIG. 2, a document includes text, which 

FIG. 2 is a diagrammatic illustration of a tagged or comprises a set of terms (t).Atermmay be a word or phrase, 
classified document; A document of the type commonly stored in information 

FIG. 3 is a diagrammatic illustration of a taxonomy into 50 storage and retrieval systems and the type with which the 
which the present invention classifies documents; present invention is concerned, such as a news article, 

FIG. 4 is a flow diagram illustrating the method of the generally comprises many different terms. Aperson can read 
present invention; me text and cate g 0li2e toe document as relating to one or 

urn cjc, a™« J;™,™ :n. * ^u^a ^ more of the classes. Some of the terms, such as the word 

FIG. 5 is a flow diagram illustrating a method for gener- tu . „ . . , , TA . . . . . 

atine term frequency statistics- 55 may be irrelevant to all classes. It should be under- 

JL, , . „ * .„ ' . . J ^ stood that irrelevant or less relevant terms may be excluded 

FIG. 6 is a flow diagram lUustrating a method for gener- ^ ^ set of tenns uscd m ±c of ^ t 

aung a semantic association between every pair of classes; invention . ^ document also includes an identifying num- 

FIG. 7 is a flow diagram illustrating a method for posi- ber ( d) and dass num bcrs (c). The document text, its 

tioning classes of a flat taxonomy in two-dimensional space; ^ identifier (d) and classes (c) may be organized, stored and 

FIGS. 8 and 8A-8C are a flow diagram illustrating a accessed by the software and hardware in any suitable 

method for positioning classes of a hierarchical taxonomy in manner. 

two dimensional space; As illustrated in FIG. 3, the classes into which a document 

FIG. 9 is a flow diagram illustrating a method for clas- may be categorized define a taxonomy. The taxonomy may 

sifying a document; 65 be hierarchical, as indicated by the combined portions of 

FIG. 10 is a flow diagram illustrating a method for FIG. 3 in solid and broken line. The root 22 represents the 

positioning a document in two dimensional space; most general level of knowledge in the subject with which 
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the database is concerned if the database concerns all topics 64, it is determined whether all terms in the document (d) 

of knowledge, the root represents "All Knowledge.** Each have been processed. If the last term in the document (d) has 

successive level of the hierarchy represents an increasing not been processed, the method proceeds to step 66. At step 

level of specificity. Classes 24, 26 and 28 are equally 66, the next term (t) is selected, and the method returns to 

specific. If the database concerns all topics, classes 24. 26 5 step 54. If all terms in the document (d) have been 

and 28 may, for example, represent "Science," "Mankind," processed, the method proceeds to step 68. At step 68, it is 

and "Religion," respectively. Classes 30, 32 and 34, which determined whether all documents in database 46 have been 

are subclasses of class 24 may, far example, represent processed. If the last document in database 46 has not been 

"Astronomy," "Chemistry," and "Hectronics " respectively. processed, the method proceeds to step 70. At step 70, the 

Alternatively, the taxonomy may be flat, as indicated by the 1Q next document (d) is selected, and the method returns to step 

portion of FIG. 3 in solid line. Each of the classes 24, 26 and 52. If all documents in database 46 have been processed, the 

28 has equal specificity. If suitable class labels are chosen, method proceeds to step 70. At this point in the method, the 

any document in any database may be categorized into one frequency of each term in the documents in database 46 in 

or more of those classes. each class, F^, has been calculated. 

As illustrated in FIG. 4, the present invention comprises 15 The method then calculates the total term frequency in 

the step 30 of generating a semantic space map, the step 32 each class, F c , by summing F r ^ over all terms that occur in 

of populating the semantic space map with documents in the documents of that class (c). At step 70, the first class (c) is 

database, and the step 34 of displaying the classes and selected. At step 72, F c , the total term frequency in the 

retrieved documents in accordance with their positions on selected class (c), is initialized to zero. At step 74, the first 

the populated semantic space map. The semantic space map, 20 term (t) is selected At step 76, the selected ¥ t c is added to 

which demies the relative spatial positions of the classes and F c , At step 78 it is determined whether all terms have been 

documents, is produced in response to statistical properties processed. If the last term has not been processed, the 

of the terms in the documents. An assumption underlying the method proceeds to step 80. At step 80, the next term (t) is 

present invention is that the topical rdatedness of documents selected, and the method returns to step 76. If all terms have 

corresponds to their semantic relatedness. 25 been processed, the method proceeds to step 82. At step 82, 

Step 30 of generating a semantic space map comprises the it is determined whether all classes have been processed. If 

step 36 of training a Bayesian classifier, which is described the last class has not been processed, the method proceeds 

in further detail below. Step 36 comprises the step 38 of to step 84. At step 84, the next class (c) is selected, and the 

manually classifying each document in a database 40 into method returns to step 72. Step 42 (FIG. 4) of generating 

one of the predetermined classes. In other words, a person or 30 term frequency statistics is completed when all classes have 

editor, reads the document and decides the class or classes been processed. 

to which the document is most related. If the taxonomy is Referring again to FIG. 4, step 30 of generating a seman- 

hierarchical, the person performing this classification must tic space map also comprises the step 86 of generating a 

decide how general or specific the document is. For semantic association, Sjy, between each pair of classes, class 

example, if the document relates to an overview of science, 35 c, and class c,. These semantic associations may be concep- 

the person might classify it in "Science.** If the document tually considered as a two-dimensional array or matrix in 

relates to the Moon, the person might classify it in which each dimension has a size equal to the total number 

"Astronomy.** A document may be classified in multiple of classes and in which the diagonal of the array represents 

classes, such as both "Science" and "Astronomy." The the semantic association between each class and itself. The 

document is tagged with the chosen class or classes, as 40 semantic association represents the extent to which the 

described above with respect to FIG. 2. documents of each pair of classes are semantically related. 

Step 36 also comprises the step 42 of generating term Classes that use terms in similar frequency are more seman- 

frequency statistics 44 in response to the manually classified tically related than those with dissimilar frequencies. Step 

documents 46. The term frequency statistics include the 86 uses the term frequency statistics to determine the 

frequency of each term in the documents in database 46 in 45 semantic association between each pair of classes. As 

each class, F r< These statistics may be conceptually repre- described below, the term frequency statistics define class 

sented by a two-dimensional array having one dimension of conditional probability distributions. The class conditional 

a size equal to the total number of unique terms in the probability of a term, (F fu /F 0 ), is the probability of the term 

collection of documents in database 46 and having the other (t) given the class (c). The probability distribution of a class 

dimension of a size equal to the total number of classes. The so (c) thus defines the probability that each term occurs in that 

term frequency statistics also include the total term fre- class (c). The semantic association between each pair of 

quency in each class, F c . These statistics may be conceptu- classes is defined by the chi-squared difference between 

ally represented by a uni-dimensional array , of a size equal their, probability distributions. 

to the number of classes. Step 86 comprises the steps illustrated in FIG. 6. The first 

Step 42 comprises the steps illustrated in FIG. 5. At step 35 pair of classes is selected At step 88. a class c t - is selected 

48, each F^. is initialized to zero. At step 50, the first ' At step 90, a class Cj is selected. The method successively 

document (d) is selected from database 46. At step 52, the selects pairs of classes at steps 88 and 90 until all unique 

first term (t) is selected At step 54, the first class (c) is pairs have been processed The order in which the classes are 

selected. At step 56, it is determined whether the document selected is not critical. At step 92, S tJ is initialized to zero, 

(d) is classified in the class (c). If document (d) is classified 60 At step 94, the first term (t) is selected Bach pair of classes 

in the class (c), F^ is incremented at step 58. If document is processed using each term. At step 96, S y is incremented 

(d) is not classified in the class (c), the method proceeds to by an amount equal to the chi-squared difference between 

step 60. At step 60, it is determined whether all of the classes the class conditional probabilities of the term (t). At step 98, 

have been processed if the last class has not been processed it is determined whether the frequency of the term (t) in both 

the method proceeds to step 62. At step 62, the next class is 65 classes is zero. If the frequency is zero, the chi-squared 

selected, and the method returns to step 56. If all classes measure is ignored, and the method proceeds to step 100. At 

have been processed the method proceeds to step 64. At step step 100, it is determined whether all terms have been 
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processed. If the last term has not been processed, the initialized to a value of one. At step 138. an array (D) is 

method proceeds to step 102. At step 102, the next term (t) initialized by setting its values (D (V ) equal to those of the 

is selected, and the method returns to step 98. If all terms array of semantic associations, S,^ at level L (i.e., classes Lj 

have been processed, the method proceeds to step 104. At are those at level L subsumed by T). 

steps 104 and 106 it is oeterrnined whether all classes have 5 M ^ 15fi ^ ^ ^ at levd L is selected At ste 

been processed. If the last class has not been processed, the 1CC „ „*„ M „ Ai An • ♦ /„ „ \ „ a »„ n *~A /* 

method returns to step 88 or 90. Step 86 (FIG. 4) of t^^^T 7 ?T V" T u 

generating a semantic association between each pair of ste ? l60 / 11 * s «><^ed whether all classes have been 

°, . ° . „ , . „ . , . r . assigned random points. If the last class has not been 

classes is completed when all unique pairs of classes have . . . . . , ., . _ 

, *j assigned a random point, the method proceeds to step 162. 

oeen processed. 10 At step 162 , the next class (c 6 ) is selected, and me method 

Referring again to FIG. 4. step 30 of generating a seman- retun)s , 0 ^ 15g jj ^ dasses ^ waheea assigned random 

be space map also comprises the step 108 of positoiung the mts me ^ t0 ste 164 At st 164 MDS 

classes m two dimensional space. As described below the „ rfome<L ^ MDS uses me points (x y > K 

method uses a suitable non-metnc mulUdunensional scahiig ^ idons and scales mem uslng me rao dffled semantic 

CM0>S) i memod, such as that described in J B. Kruskal. 15 MSOciations D as inter-point similarities. At step 168. the 

•MulUdimensional scaling by optirrdmg goodness of fit to ^ganHon stress is recorded. For the reasons described 

anonmetnchypothesis. Psychomemka, 29(l):l-27. March tbgm with ^ to mQ ? fcc m&gA perfonns stcp5 

1964 and J. B. Kruskal. "Nonmetric nudtidimensional seal- fof n iterations ^ m n „ a number preferably 

SE. A ,T°S£ Psychometnkn, 29:115-129, ta man ^ 10 . At rt 170 mc mcthod dctcrmines 

1964. The MDS Ixansforms the semantic associations ffi whethcr n itoati ons have been performed. At step 172. the 

between the pairs of classes into spatial coordinates of the ^ conflgurat ; OI1 naving me lowest recorded ^5 k 

classes. selected. 

In an embodiment of the present invention in which the A . .' , . . - * . . . , . 

. M , . « . * ^ 1AO . At step 174, the selected configuration of points is scaled 

classes are arranged in a flat taxonomy, step 108 compnses . . *V . . „ ... . .f . . 

a» in.,***^ j« cr/- t a* tin JL to center it on point P with a maximum radius equal to the 

the steps illustrated in FIG. 7. At step 110, the first class (c) 25 i u / 10A a i j ^ * 

i j a* * hi #» . .V scale s. At step 180, the scaled configuration of points 

is selected. At step 112, a corresponding random point, f 4 . ... ~ . . Al _ r . . 

, . . * / a* * .... , . j 11 representing the classes at level Lis recorded in the semantic 

(x-fY.), is generated. At step 114, it is determined whether all r * ^ . « . , 4 . ri , 

i t. u • ^ j . . T p 1*1 space map. As the steps described below indicate, if there are 

classes have been assigned random points. If the last class * , , - 4 v r . .„ ^ . , , , 

. . . . j j j more levels the method will recurse. When the last level has 

has not been assigned a random point, the method proceeds , . . L . « , ... ^ A , , 

. A# i-uc Pfl u^ tt ; nn A been reached, the method processes another class at that 

to step 116. At step 116, the next class (c) is selected, and the 30 * , ... „, , . , r * . * JA 

, « . r + in „ „ , x \ , . . level until all levels and classes have been processed and the 

method returns to step 112. If all classes have been assigned , . +u . . A . r 

randompoints, the mkhod proceeds to step 118. At step 118, method has ^ toversed me merarch y- 
MDS is performed. The MDS method uses the points (x^y.) ^ mcthod ^ P^ es ^ trav< f iC ^ cxt ^ " m . e 
as initial positions and scales them using the semantic ^archy. At step 182, the first class (c*) at level L is 
associations S,, as inter-point similarities. Scaling is the 35 Al ; ste P ^ \ lhc °°^ T J ? set d ? ss< l' ^ tste ? 
optimization of the goodness-of-fit between the configura- 18S ^ lt 1S determined whether there is another level L+l 
tion and the inter-point similarities to produce an optimized under T - ff so > processing contmues with step 186. If not, 
configuration that better matches the inter-point similarities. Posing continues with step 196. At step 186, point P is 
The optimization method known as Conjugate Gradient is set to P^ 00 of te dass c ' at level L determined at 
preferred. Conjugate Gradient, a method well-known to 40 step 174 ' St< * s 188 190 change * c scalc ' At stcp 188 ^ 
those of skill in the art, is described in William H. Press et a number r is set «l ual to toe smaUest d 1 ^ 6 between ^ 
al„ Numerical Recipes in C: The Art of Scientific two points at level L. At step 190, the scale (s) is set equal 
Computing, Cambridge University Press, 1988. The stress t0 r/h - where h 15 a constant co^olling the amount of 
function known as Kruskal's stress is preferred, although separation between classes in the semantic map and prefer- 
Guttraan's Point Alienation is also suitable. At step 122. the 45 M ? has a value between 10 ™ d loa At ste P M - ^ level 
configuration stress is recorded. The method performs steps ^> fe incremented. At step 194, the method rccurses. 
110-122 for n iterations, where n is a number preferably After recursion, processing continues at step 196. At step 
greater than about 10. Iteratively performing steps 110-122 196 ? il is determined whether there are more classes at level 
maximizes the likelihood that the lowest stress is a good fit I* If the last class has not been processed, the method 
to the inter-point similarities S i(/ and is not the result of a 50 proceeds to step 198. At step 198, the next class at level L 
poor initial point selection. At step 124, the memod deter- is selected, and the method returns to step 184. If the last 
mines whether n iterations have been performed. At step class ^ tee* 1 F ocesse d, all classes in the hierarchy have 
126. the point configuration having the lowest recorded Deen positioned in the semantic space map. 
stress is selected and recorded in the semantic space map Referring again to FIG. 4, the step 32 of populating the 
128 (FIG. 4). 55 semantic space map with documents in the database corn- 
In an embodiment of the present invention in which the prises the step 200 of classifying the documents and the step 
classes are arranged in a hierarchical taxonomy, step 108 202 of positioning the documents in the semantic space map. 
comprises the steps illustrated in FIG. 8. FIG. 8 is particu- A maximum likelihood estimation determines the most 
larly arranged to illustrate traversal of the hierarchy using probable class for a document using the term frequency 
recursive software, although persons of skill in the art will 60 statistics generated at step 42. The database may include 
be capable of implementing the method non-recursively in new documents 204 as well as the manually-classified initial 
view of these teachings. At step 130, an initial node (T) in documents 40. 

the hierarchy is selected and set to a value that represents the Step 200 comprises the steps illustrated in FIG. 9. The 

hierarchy root node. At step 132, the first level (L) below the method classifies a document (d) in a class (c) by estimating 

root node of the hierarchy is selected. At step 134. an initial 65 the semantic association between the document and each 

point (P) is set to a value of (0,0), which represents the center class and selecting the class to which the association is 

of the semantic space map. At step 136, a scale (s) is greatest At step 206, a first class c is selected. At step 208. 
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a class score V cM is initialized to zero. At step 210, the first difference between it and the lowest score, with the 

term tin document d is selected. Bayesian classifiers require difference raised to the same power K. The lowest score, 

a non-zero weight for each term. Therefore, if the frequency is not adjusted At step 232, the coordinates of the 

(F^) of the term t is zero in class c, it is replaced with a point at which document d is positioned are determined. The 

predetermined constant K that is small in relation to the 5 point (x^y^), is the weighted mean of the positions of the 

frequency of the remaining terms. At step 212, it is deter- classes (x c ,y e ) corresponding to the four selected scores, as 

mined whether F, rf is zero. If F f c is zero, at step 214, a determined at step 108 (FIG. 4). 

modified frequency P is set equal to the constant K, which Returning to FIG. 4, the populated semantic space map 

is preferably between about 0.001 and 1.0. This range for 234 comprises the positions of the classes (x c ,y c ) and the 

constant K was empirically estimated based on cross- j 0 positions of the documents (x d ,yj). Step 34 of displaying the 

validation experiments in which the classifier was trained classes and retrieved documents comprises the step 236 of 

using a portion of the documents in the database and then retrieving one or more documents 238 from a document 

used to classify the remaining documents. This range for storage and retrieval system and the step 240 of looking up 

constant K yielded acceptable classification results. If F^ is the positions of the classes and the retrieved documents in 

non-zero, at step 216. the modified frequency P cjd is set 15 the semantic space map. A user inputs a query 242 using any 

equal to F fur . At step 218, the natural logarithm of the ratio suitable query interface known in the art For example, the 

of P cM to F c is added to P c & At step 220, it is determined if query may be a simple Boolean expression such as "Big 

all terms t in document d have been processed. If the last Bang and history," or it may be a complex natural language 

term t has not been processed, the method proceeds to step expression: 

222. At step 222, the next term t is selected, and the method 20 W^ 3 * k mc history of the Big Bang theory? Describe the 

returns to step 212. If all terms t have been processed, the recent experiments confirming the Big Bang theory, 

method proceeds to step 224. At step 224, it is determined Has the Big Bang theory had any impact on religion or 

whether all classes c have been processed. If the last class c philosophy? 

has not been processed, the method proceeds to step 226, At Step 236 may retrieve documents in any suitable manner in 

step 226, the next class c is selected, and the method returns 25 response to query 242. Step 236 may search all documents, 

to step 208. Classification is complete when all classes c including both new documents 204 added to the database 

have been processed. The result of classification step 200 and the initial manually-classified documents 40, as indi- 

(FIG. 4) is a set of class scores P cjf for document d. cated by the dashed line in FIG. 4. 

As illustrated in FIG. 11, retrieved documents 238 may be Although all retrieved documents 238 and classes may be 

displayed on video display terminal 16 along with the 30 displayed immediately in response to query 242, in a pre- 

classes to which they relate. In the example described above ferred eirmodiment the classes alone are first displayed. The 

with respect to FIG. 3, three classes 24, 26 and 28 are classes, i.e., the tokens or icons representing the classes on 

designated "Science", "Mankind" and "Religion", respec- video display 16, may be modified or supplemented with 

tively- As described above, the relative positions of classes information to convey to the user that retrieved documents 

24, 26 and 28 and documents 238 represent the extent to 35 relate to those classes. The user may select a class to display 

which they are related to one another. Classes and docu- the related documents. The user may then select a document 

ments that are closer to one another are more related, and to display its text 

classes and documents that are further from one another are Obviously, other embodiments and modifications of the 

less related. The user then may select one of documents 238 present invention will occur readily to those of ordinary skill 

to display its text or may select one of classes 24, 26 and 28 40 in the art in view of these teachings. Therefore, this inven- 

to display a lower level of the hierarchy. As described above tion is to be limited only by the following claims, which 

with respect to FIG. 3, class 24, which is designated include all such other embodiments and modifications when 

"Science", has three subclasses 30, 32 and 34, which are viewed in conjunction with the above specification and 

designated "Astronomy", "Chemistry" and "Electronics", accompanying drawings, 

respectively. Therefore, selecting class 24 causes those 45 We claim: 

retrieved documents 238 that relate to class 24 to be 1. A method for visually representing the semantic relat- 
displayed, along with the subclasses 30, 32 and 34 of class edness between a predetermined plurality of classes and a 
24, as illustrated in FIG. 12. As described above, documents plurality of documents, each document stored in a computer 
238 that relate to class 24 have positions relative to one document retrieval system, said plurality of documents 
another that represent the extent to which they are related to 50 collectively comprising a plurality of terms in computer- 
one another, and have positions relative to subclasses 30, 32 readable format, each document having a tag representing 
and 34 that represent the extent to which they are related to the topical rclatedness of said document to each said class, 
those subclasses. said method comprising the steps of: 

The step 202 (FIG. 4) of positioning a classified document generating a semantic space map in response to said terms 

d in the semantic space map comprises the steps illustrated 55 and said tag of each document of said plurality of 

in FIG. 10. At step 228, the four highest class scores, P tfUff documents, said semantic space map representing the 

p and P^ are selected. Document d is positioned position in a plurality of dimensions of each class 

in response to the weighted positions of the four classes to relative to every other said class, the spatial distance 

which the selected scores correspond. At step 230, each class between said position of said each class and said every 

score is adjusted to emphasize the higher scores relative to 60 other class corresponding to the semantic relatedness of 

the lower scores. The highest score, P cl ^ is adjusted by said each class to said every other class; 

setting it equal to the difference between it and the lowest populating said semantic space map in response to said 

score, P^, with the difference raised to a power K greater plurality of documents, said populated semantic space 

than one. The next highest score, P cidJ is adjusted by setting map representing the position in a plurality of dimen- 

it equal to the difference between it and the lowest score, 65 sions of each class relative to every other class and of 

p c4^i with the difference raised to the same power K. The each document relative to each class, the spatial dis- 

next highest score, P^^ is adjusted by setting it equal to the tance between said position of said each document 
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relative to said each class corresponding to the seman- 
tic relatcdness of said each document to said each class; 
and 

displaying a visual representation of at least a portion of 
said populated semantic space map. 5 

2. The method claimed in claim 1, wherein said step of 
generating a semantic space map comprises the steps of: 

generating a semantic association between each said class 

and every other said class in response to said plurality 

of documents; and 
multidimensionally scaling said plurality of classes in 

response to said semantic association between each 

said class and every other said class. 

3. The method claimed in claim 2, wherein said step of J5 
generating a semantic space map further comprises the step 

of generating term frequency statistics in response to said 
plurality of documents. 

4. The method claimed in claim 3, wherein said step of 
generating a semantic association comprises the step of 2Q 
generating a plurality of class conditional probability 
distributions, each corresponding to one of said classes and 
each comprising the probability of each said term occurring 

in said one of said classes. 

5. The method claimed in claim 4, wherein said step of 25 
generating a semantic association comprises the steps of 
determining the Chi-Squared measure of distance between 
each class conditional probability distribution and every 
other said class conditional probability distribution. 

6. The method claimed in claim 1. wherein said step of 3Q 
populating said semantic space map in response to said 
plurality of documents comprises the steps of: 

producing a set of class scores for each said document 
using a statistical classifier to produce a classified 
document in response to said terms in said document; 35 
and 

positioning said classified document in said plurality of 
dimensions. 

7. The method claimed in claim 6, wherein: 

said statistical classifier is trained in response to said 40 
plurality of documents; and 

said step of populating said semantic space map com- 
prises the steps of producing a set of class scores for a 
new document to produce a classified new document, 
and positioning said classified new document in said 45 
plurality of dimensions. 

8. Hie method claimed in claim 6, wherein said step of 
producing a set of class scores comprises the step of per- 
forming a maximum likelihood estimation. 

9. The method claimed in claim 8, wherein: 50 
said step of generating a semantic association between 

each said class and every other said class comprises the 
step of computing the probability and frequency of 
each said term occurring in each said class; and 
said set of class scores is computed in response to said 
probability of each said term occurring in each said 
class. 

10. The method claimed in claim 9, wherein: 

if the probability of a term occurring in a class is nonzero, 
each class score is computed by summing, over said 
terms in each said document, the logarithm of the 
quotient of the frequency of said term occurring in said 
class divided by the sum of the sum of the frequencies 
of all terms occurring in said class; and 55 

if the probability of a term occurring in a class is zero, 
each class score is computed by slimming, over said 



55 



terms in each said class, the logarithm of the quotient 
of a predetermined constant divided by the sura of the 
frequencies of all terms occurring in said class. 
U. The method claimed in claim 10, wherein said pre- 
determined constant is greater than or equal to 0.001 and less 
than or equal to 1.0. 

12. The method claimed in claim 6, wherein said step of 
positioning said classified document comprises the steps of: 

selecting a predetermined number of class scores in each 
set. said selected class scores being greater than all 
other said class scores in said set; 

normalizing said selected class scores to produce normal- 
ized class scores; and 

determining the weighted average of said positions of said 
classes corresponding to said normalized class scores 
using said normalized class scores as weights. 

13. The method claimed in claim 1, wherein said step of 
displaying a visual representation of at least a portion of said 
populated semantic space map comprises the steps of: 

querying a document retrieval system to produce a set of 
located documents; and 

displaying a visual representation of each said class and a 
visual representation of each said located document in 
response to said populated semantic space map. 

14. The method claimed in claim 13, wherein: 

the spatial distance between the visual representation of 
each class and the visual representation of every other 
said class corresponds to the semantic relatedness of 
said class to said every other said class; and 

the spatial distance between the visual representation of 
each located document and the visual representation of 
every other said located document approximately cor- 
responds to the semantic relatedness of said located 
document to said every other said located document 

15. A system for displaying a visual representation of 
documents and classes to which said documents relate, 
comprising: 

a document retrieval subsystem having a user input device 
for receiving a user query, a user output device for 
displaying graphical representations of documents and 
a predetermined plurality of classes, and memory for 
storing said documents, each said document having a 
topical association with one of said classes, said docu- 
ment retrieval system providing retrieved documents in 
response to said user query; 

a classification subsystem for computing a semantic relat- 
edness between each class and every other one of said 
classes and for producing a set of class scores for each 
stored document, each class score in said set represent- 
ing a semantic relatedness between said stored docu- 
ment and one of said classes; and 

a visualization subsystem for producing a semantic space 
map in response to said semantic relatedness between 
each class and every other one of said classes, for 
populating said semantic space map with said stored 
documents in response to said sets of class scores and 
for displaying a populated semantic space map on said 
user output device. 

16. A machine-readable computer data storage medium 
having stored therein a program, comprising: 

a term frequency statistics generator for generating term • 
frequency statistics in response to a plurality of pre- 
classified documents, each having a plurality of terms 
and each topically associated with a predetermined one 
of a predetermined plurality of classes; 



03/03/2004, EAST Version: 1.4.1 



5,625,767 



13 



14 



a semantic space map generator for generating a semantic 
association between each class of said plurality of 
classes and every other class of said plurality of classes 
in response to said term frequency statistics and topical 
association between each document and predetermined 
one of said plurality of classes; 

a multidimensional scaler for positioning said plurality of 
classes in a plurality of dimensions in a semantic space 
map in response to said semantic association between 
each said class and every other said class; 

a statistical classifier for producing a set of class scores for 
a document in response to frequencies of terms in said 
document and for positioning said document in said 
semantic space map in response to said set of class 
scores, said set of class scores representing the seman- 
tic association between said document and each said 
class; and 

a semantic space map populator for positioning said 
document corresponding to said set of class scores in 
said semantic space map. 

17. The data storage medium claimed in claim 16, 
wherein said semantic space map generator generates a 
plurality of class conditional probability distributions, each 
corresponding to one of said classes and each comprising the 
probability of each said term occurring in said one of said 
classes. 

18. The data storage medium claimed in claim 17, 
wherein said semantic space map generator computes said 
semantic association in response to the Chi-Squarcd mea- 
sure of distance between each class conditional probability 
distribution and every other said class conditional probabil- 
ity distribution. 

19. The data storage medium claimed in claim 16, 
wherein said statistical classifier computes said set of class 
scores in response to a maximum likelihood estimation. 
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20. Hie data storage medium claimed in claim 19. 
wherein: 

if the probability of a term occurring in a class is nonzero, 
said statistical classifier computes each class score by 
summing, over said terms in each said pre-classified 
document, the logarithm of the quotient of the fre- 
quency of said term occurring in said class divided by 
the sum of the sum of the frequencies of all terms 
occurring in said class; and 

if the probability of a term occurring in a class is zero, said 
statistical classifier computes each class score by 
summing, over said terms in each said class, the 
logarithm of the quotient of a prectetcrmined constant 
divided by the sum of the frequencies of all terms 
occurring in said class. 

21. The data storage medium claimed in claim 20, 
wherein said predeterrnined constant is greater than or equal 
to 0.001 and less than or equal to 1.0. 

22. The data storage medium claimed in claim 16, 
wherein: 

said semantic space map populator selects a predeter- 
mined number of class scores in each set, said selected 
class scores being greater than all other said class 
scores in said set; 

said semantic space map populator normalizes said 
selected class scores to produce normalized class 
scores; and 

said semantic space map populator computes a weighted 
average of said positions of said classes corresponding 
to said normalized class scores using said normalized 
class scores as weights. 
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